School of Computer Sciences, Universiti Sains Malaysia.
Semester 2, 2020/2021
Lee Yong Meng (P-COM0012/20)
Introduction
Section 1: ARIMA
Section 2: SARIMAX
Section 3: Machine Learning
Bike Sharing Demand Data: https://www.kaggle.com/c/bike-sharing-demand/data
Photo by Aaron Doucett on Unsplash
About this dataset (extracted from Kaggle):
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
This dataset consists of 2 files:
train.csv: records of the first 19 days of each month.test.csv: records of the days from 20th of each month.train.csv and test.csv¶datetime: hourly date + timestampseason: 1 = spring, 2 = summer, 3 = fall, 4 = winter holiday: whether the day is considered a holidayworkingday: whether the day is neither a weekend nor holidayweather: temp: temperature in Celsiusatemp: "feels like" temperature in Celsiushumidity: relative humiditywindspeed: wind speedtrain.csv only¶casual: number of non-registered user rentals initiatedregistered: number of registered user rentals initiatedcount: number of total rentalsColumns/Measurements under consideration: temp, humidity, windspeed
Target column: count
# ================================================================================
# Import Pandas Library
# ================================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("paper")
sns.set_style("whitegrid")
# sns.set(rc={'figure.figsize':(15, 6)})
%matplotlib inline
plt.rcParams["figure.dpi"] = 200
# ================================================================================
# Training set
# ================================================================================
df_train = pd.read_csv('src/train.csv')
df_train.head()
# Information about the training set
print("> Number of rows and columns in the training set:")
display(df_train.shape)
print("> The general statistics of each column in the training set:")
display(df_train.describe())
display(df_train.describe(include=[object]))
print("> The data type and number of non-null values in each column:\n")
display(df_train.info())
# Check missing value
np.sum(df_train.isna())
Note: There is no missing value in the training set.
Check if count = casual + registered
# Check data validity
np.sum(df_train['count'] != df_train['casual'] + df_train['registered'])
Note: Verified, none of the rows have different count values from casual + registered.
Therefore, count = casual + registered.
The next thing to do is to explore the correlation between columns/measurements under consideration and count.
(Assume that count is of interest.)
# ================================================================================
# Inspect correlation between each feature to the target
# ================================================================================
FEATURES = ['temp', 'humidity', 'windspeed']
def display_correlation(df, features, target, ascending=False):
print(f"> Correlation between selected features with target: \"{target}\"")
display(df[features+[target]].corr()[target].sort_values(ascending=ascending)[1:])
print("Selected features are:")
_ = [print(f"* {f}") for f in FEATURES]
print()
display_correlation(df_train, FEATURES, 'count')
display_correlation(df_train, FEATURES, 'casual')
display_correlation(df_train, FEATURES, 'registered')
Note: In all cases, the feature temp (temperature) is highly correlated to different assigned targets.
This is only for train, therefore, it is very likely that the test set have quite different distribution, who knows?
# ================================================================================
# Test set
# ================================================================================
df_test = pd.read_csv('src/test.csv')
df_test
# Information of test set
print("> Number of rows and columns in the test set:")
display(df_test.shape)
print("> The general statistics of each column in the test set:")
display(df_test.describe())
display(df_test.describe(include=[object]))
print("> The data type and number of non-null values in each column:\n")
display(df_test.info())
df_train.head()
# ================================================================================
# Generate time series data
# ================================================================================
FEATURE = 'temp'
DATETIME = 'datetime'
def generate_time_series_data(df, feature, datetime):
df_new = df.copy()
df_new[datetime] = pd.to_datetime(df_new[datetime])
df_new = df_new[[datetime, feature]]
df_new.set_index(datetime, drop=True, inplace=True)
return df_new
generate_time_series_data(df_train, FEATURE, DATETIME).head()
print("Successful!")
df_train_ts = generate_time_series_data(df_train, FEATURE, DATETIME)
df_train_ts.describe()
df_test_ts = generate_time_series_data(df_test, FEATURE, DATETIME)
df_test_ts.describe()
# Average Hourly Temperature
df_temp = df_train_ts.append(df_test_ts)
df_temp.sort_index(inplace=True)
df_temp.plot(figsize=(10, 4))
# df_temp.to_csv('output/hourly_temp.csv')
# ================================================================================
# Daily temperature data
# ================================================================================
df_temp_daily = df_temp.resample("d").mean()
ax = df_temp_daily.plot(figsize=(10, 4), title="Average Daily Temperature: Jan 2011 - Dec 2012", legend=False)
ax.set_xlabel('Date')
ax.set_ylabel('Temperature')
print()
# df_temp_daily.to_csv('output/daily_temp.csv')
We use Augmented Dickey-Fuller test to check stationarity of our time series data.
# ================================================================================
# Check stationarity using ADF test
# ================================================================================
# Import adfuller() function from statsmodels package.
from statsmodels.tsa.stattools import adfuller
# Define helper function to check stationarity using ADF test.
def check_stationarity(timeseries):
result = adfuller(timeseries, autolag='AIC')
# Generate Pandas Series using the result of the ADF test.
dfoutput = pd.Series(result[0:4],
index=['Test Statistic',
'p-value',
'Number of Lags Used',
'Number of Observations Used'])
display(dfoutput)
print('The test statistic: %f' % result[0])
print('p-value: %f' % result[1])
# Print critical values for the test statistics at 1%, 5%, and 10% level.
print('Critical Values:')
for key, value in result[4].items():
print('%s: %.3f' % (key, value))
Note: Recall that:
# ================================================================================
# Check stationarity of daily temperature data using the ADF test.
# ================================================================================
check_stationarity(df_temp_daily)
# ================================================================================
# Differencing the daily temperature data
# ================================================================================
first_diff = np.diff(df_temp_daily[FEATURE])
df_first_diff = pd.DataFrame(first_diff, columns=['First Difference'], index=df_temp_daily.index[1:])
plt.figure(figsize=(10, 4))
plt.title("Average Daily Temperature after First Differencing")
plt.plot(df_first_diff, label="First Difference")
plt.xlabel("Date")
plt.ylabel("Temp Difference")
plt.show()
df_first_diff
print("[ADF Test]")
print("=== First Difference: Original ===")
check_stationarity(first_diff)
Note: The $p$-value is less than 0.05, implying that the daily temperature after first differencing is stationary (since we have strong evidence to reject $H_0$).
Therefore, the proper integration order $q$ for the daily temperature data is 1.
We use Autocorrelation (ACF) and Partial Autocorrelation (PACF) to identify the values of $q$ and $p$ of the ARIMA model respectively, where:
# ================================================================================
# PACF and ACF of the daily temperature data
# ================================================================================
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
temp_diff = first_diff
# Set plot figure size
fig = plt.figure(figsize=(10,6))
# Plot ACF from 'df'
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(temp_diff, lags=60, ax=ax1)
# Plot PACF from 'df'
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(temp_diff, lags=60, ax=ax2)
It's difficult to tell which $p$ and $q$ should be used.
Therefore, we use autoarima function implemented in Python's pmdarima package to search for the optimal orders for an ARIMA model to fit the daily temperature data.
Apply Auto ARIMA on the daily temperature data (before first differencing).
# ================================================================================
# ARIMA: Determine p, d, q using Auto ARIMA
# ================================================================================
import pmdarima as pm
auto_arima_fit = pm.auto_arima(df_temp_daily[FEATURE],
# test='adf',
start_p=1, start_q=1,
max_p=7, max_q=7,
seasonal=False, # if True: SARIMA; otherwise: ARIMA
trace=True, # if True: print debugging information; otherwise: no print
error_action='ignore',
suppress_warnings=True,
stepwise=True)
The best ARIMA model for daily temperature is ARIMA(2,1,1) - it has the lowest AIC value.
Note: Negative AIC value is smaller, thus better model.
Define ARIMA(2,1,1) model using ARIMA implemented in the Python's statsmodels package.
import statsmodels
statsmodels.__version__
# ================================================================================
# ARIMA: Build Model
# ================================================================================
import warnings
warnings.filterwarnings("ignore") # specify to ignore warning messages
from statsmodels.tsa.arima_model import ARIMA
# Generate ARIMA estimator using temperature and specified order: (p, d, q)
mod = ARIMA(df_temp_daily[FEATURE], order=(2, 1, 1), freq='D')
# results = mod.fit(disp=0)
results = mod.fit(disp=0)
# Print the test result of the ARIMA estimator/model
print(results.summary())
A model residuals is the difference between the predicted and expected value and can be verified using the fitted model property resid.
Residual object is of type ndarray so we will store it in a DataFrame for plotting.
In the below line plot we don’t see any large residuals and all of them are within their upper and lower limits.
# ================================================================================
# ARIMA: Residual plot
# ================================================================================
residuals = pd.DataFrame(results.resid, columns=['Residuals'])
ax = residuals.plot(figsize=(10, 4), title="Average Daily Temperature: Residual Plot", legend=False)
ax.set_xlabel("Date")
ax.set_ylabel("Residual")
print()
Next we will check if these residuals are normally distributed and looks Gaussian or not. So we will plot the density plot to check this. This looks normal with a long left tail and centered at zero.
residuals.plot(kind='kde', figsize=(10, 4))
The mean of the residual is close to zero and there is no significant correlation also that we can see in the PACF plot for residuals.
# ================================================================================
# ARIMA: ACF and PACF of the residual
# ================================================================================
display(residuals.describe())
# plt.figure(figsize=(10, 4))
plt.rc("figure", figsize=(10, 4))
plot_pacf(residuals, lags=40)
plot_acf(residuals, lags=40)
print()
The residual diagnostics looks like a white noise since 95% of our sample autocorrelations is within the blue shaded region and it meets all our criteria for a good forecast and prediction.
So let’s go ahead and evaluate the predicted and expected values.
Predict the daily temperatures using the ARIMA model. The actual and predicted temperatures for the first 660 days are plotted for comparison (days_pred).
Note: When the integration order $d$ of the ARIMA model is not 0, the predict() function returns the predicted $d$-order differencing values.
Therefore, upon plotting the predicted temperatures by the ARIMA model, the predicted values are added back to the values of the first differencing.
# ================================================================================
# ARIMA: Original and Predicted Plot
# ================================================================================
from math import sqrt
from sklearn.metrics import mean_squared_error
days_pred = 660
plt.figure(figsize=(10, 4))
# The first value from the original data.
temp_pred = [df_temp_daily[FEATURE][0]]
for diff in results.predict().values:
temp_new = temp_pred[-1] - diff
temp_pred.append(temp_new)
temp_pred_series = pd.Series(temp_pred[1:], index=results.predict().index)
# print(len(temp_pred))
# Plot temperatures: before reverse transform
plt.plot(df_temp_daily[FEATURE][:days_pred], label='Temperature: Actual', color='r')
# plt.plot(results.predict(1, days_pred) + temp_pred[1:days_pred+1], label='Temperature: Predicted', color='g') # Predict the difference
plt.plot(temp_pred_series[:days_pred], label='Temperature: Predicted', color='g') # Predict the difference
plt.title("Average Daily Temperature: Observed vs Predicted")
plt.xlabel("Date")
plt.ylabel("Temperature")
plt.legend()
plt.show()
Forecast the daily temperature from 1 - 20 Jan 2013 (20 days) using ARIMA(2,1,1) model.
# type(results)
# ================================================================================
# ARIMA: Forecast values
# ================================================================================
# # Set # months (20 days)
n = 20
# Generate forecast for the next 3 years
forecast, err, ci = results.forecast(steps=n)
# Create new Dataframe using forecast value from 1 Jan 2020.
df_forecast = pd.DataFrame({'forecast': forecast},
index=pd.date_range(start='1/1/2013', periods=n, freq='D'))
print(df_forecast.head(20))
df_forecast.shape
# ================================================================================
# ARIMA: Forecast Plot
# ================================================================================
# Plot actual data from the 1,600th record.
date_sep = '2012-09-01'
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]
ax = df_temp_daily[FEATURE][-num_days_since_sep_01:].plot(label='Temperature: Actual', figsize=(10, 4))
ax = temp_pred_series[-num_days_since_sep_01:].plot(ax=ax, label='Temperature: Predicted')
df_forecast.plot(ax=ax, label='Temperature: Forecast')
# Plot confidence interval (if not NaN, when p-value > 0.05)
ax.fill_between(df_forecast.index,
ci[:, 0], ci[:, 1],
color='g', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Temperature')
plt.title("Average Daily Temperature: Actual vs Predicted vs Forecast")
plt.legend(['Temperature: Actual', 'Temperature: Predicted', 'Temperature: Forecast'])
plt.show()
Check different types of errors generated by the ARIMA model when predicting daily temperatures.
# ================================================================================
# ARIMA: RMSE and MSE
# ================================================================================
# Load specific evaluation tools
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tools.eval_measures import rmse
# date_sep = '2012-09-01'
# Relative error
def mean_relative_error(y_true, y_pred,):
import numpy as np
relative_error = np.average(np.abs(y_true - y_pred) / y_true, axis=0)
return relative_error
# mae_calc = mean_absolute_error(df_temp_daily[FEATURE][1:], temp_pred_series)
# mre_calc = mean_relative_error(df_temp_daily[FEATURE][1:], temp_pred_series)
# mse_calc = mean_squared_error(df_temp_daily[FEATURE][1:], temp_pred_series)
# rmse_calc = rmse(df_temp_daily[FEATURE][1:], temp_pred_series)
num_days = df_temp_daily.shape[0]
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]
temp_true = df_temp_daily[FEATURE][-num_days_since_sep_01:]
temp_pred = temp_pred_series[-num_days_since_sep_01:]
mae_calc = mean_absolute_error(temp_true, temp_pred)
mre_calc = mean_relative_error(temp_true, temp_pred)
mse_calc = mean_squared_error(temp_true, temp_pred)
rmse_calc = rmse(temp_true, temp_pred)
print('MAE: ', mae_calc)
print('MRE: ', mre_calc)
print('MSE: ', mse_calc)
print('RMSE:', rmse_calc)
SARIMAX is a variation of ARIMA, which helps to handle the seasonality components in the time series data. In this assignment, a SARIMAX model is also built to fit the daily temperature data - which is treated under the category of ARIMA model.
Set m=7 for daily temperature data - implying the weekly (7-day) seasonality within the daily temperature data.
# ================================================================================
# SARIMAX: Determine orders using Auto ARIMA
# ================================================================================
Sarimax_model = pm.auto_arima(df_temp_daily[FEATURE],
start_P=1, start_q=1,
max_p=3, max_q=3,
m=7,
seasonal=True,
d=None,
D=1,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
The best SARIMAX model for daily temperature (for m=7) is SARIMAX(2,0,1)(0,1,1,7) - again, it has the lowest AIC value.
Define SARIMAX(2,0,1)(0,1,1,7) model using ARIMA implemented in the Python's statsmodels package.
order=(2,0,1)seasonal_order=(0,1,1,7)# ================================================================================
# SARIMAX: Build model
# ================================================================================
import statsmodels.api as sm
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(df_temp_daily[FEATURE], order=(2, 0, 1),
seasonal_order=(0, 1, 1, 7),
enforce_stationarity=False,
enforce_invertibility=False)
results = model.fit()
print(results.summary())
Using the same steps in ARIMA Step 5, the residual, ACF and PACF of the SARIMAX model are plotted.
# ================================================================================
# SARIMAX: Residual plot
# ================================================================================
residuals = pd.DataFrame(results.resid, columns=['Residual'])
ax = residuals.plot(figsize=(10, 4), title="Average Daily Temperature: Residual Plot", legend=False)
ax.set_xlabel('Date')
ax.set_ylabel('Residual')
residuals.plot(kind='kde')
plt.rc("figure", figsize=(10, 4))
plot_pacf(residuals, lags=40)
plot_acf(residuals, lags=40)
print()
The residual diagnostics looks like a white noise since 95% of our sample autocorrelations is within (or very close) the blue shaded region and it meets all our criteria for a good forecast and prediction.
Predict the daily temperatures using the SARIMAX model. The actual and predicted temperatures for the first 660 days are plotted for comparison (days_pred).
# ================================================================================
# SARIMAX: Original and Predicted Plot
# ================================================================================
days_pred = 660
plt.figure(figsize=(10, 4))
plt.plot(df_temp_daily[FEATURE][:days_pred], label='Temperature: Actual', color='r')
plt.plot(results.predict(0, days_pred), label='Temperature: Predicted', color='g' )
plt.title("Average Daily Temperature: Actual vs Predicted")
plt.xlabel("Date")
plt.ylabel("Temperature")
plt.legend()
plt.show()
Forecast the daily temperature from 1 - 24 Jan 2013 (24 days) using the SARIMAX model.
# ================================================================================
# SARIMAX: Forecast values
# ================================================================================
forecast = results.predict(start=len(df_temp_daily[FEATURE]),
end=len(df_temp_daily[FEATURE])+24,
typ='levels').rename('data sarimax (1,0,1) forecast')
# ================================================================================
# SARIMAX: Forecast plot
# ================================================================================
date_sep = '2012-09-01'
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]
num_days = df_temp_daily.shape[0]
ax = df_temp_daily[FEATURE][-num_days_since_sep_01:].plot(figsize=(10, 4), legend=True, label='Temperature: Actual')
ax = results.predict(num_days-num_days_since_sep_01, num_days).plot(ax=ax, legend=True, label='Temperature: Predicted')
ax = forecast.plot(ax=ax, legend=True, label='Temperature: Forecast')
ax.set_title("Average Daily Temperature: Actual vs Predicted vs Forecast")
ax.set_xlabel('Date')
ax.set_ylabel('Temperature')
print()
Check different types of errors generated by the ARIMA model when predicting daily temperatures.
# ================================================================================
# SARIMAX: Error
# ================================================================================
# Load specific evaluation tools
from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse
num_days = df_temp_daily.shape[0]
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]
temp_true = df_temp_daily[FEATURE][-num_days_since_sep_01:]
temp_pred = results.predict(num_days-num_days_since_sep_01, num_days-1)
mae_calc = mean_absolute_error(temp_true, temp_pred)
mre_calc = mean_relative_error(temp_true, temp_pred)
mse_calc = mean_squared_error(temp_true, temp_pred)
rmse_calc = rmse(temp_true, temp_pred)
print('MAE: ', mae_calc)
print('MRE: ', mre_calc)
print('MSE: ', mse_calc)
print('RMSE:', rmse_calc)
In this section, different machine learning models are implemented and their performances on time series prediction tasks are evaluated using different error metrics.
Finally, the best model is applied to perform the forecasting task on the daily temperature data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import seaborn as sns
from numpy import asarray
from pandas import concat
from pandas import DataFrame
# sns.set(rc={'figure.figsize':(10, 6)})
sns.set_context("paper")
sns.set_style("whitegrid")
# sns.set(rc={'figure.figsize':(15, 6)})
%matplotlib inline
plt.rcParams["figure.dpi"] = 200
df = pd.read_csv('output/daily_temp.csv', index_col='datetime')
df.index = pd.to_datetime(df.index)
df.head()
Refresh: Define temp as the target column, then plot the daily temperature data (i.e., time series data).
FEATURE = 'temp'
df[FEATURE].plot(figsize=(10, 4))
Transform the time series prediction and forecasting tasks into regression tasks by defining new features using rolling window technique.
Generate the 7-day average temperature values using the daily temperature readings from the past 7 days. Define new column temp_7_day_mean to store these values.
#7-day rolling mean
df.rolling(7).mean()[0:21]
df['temp_7_day_mean'] = df[FEATURE].rolling(window=7).mean()
# df_7day = pd.DataFrame(df[[FEATURE, 'temp_7_day_mean']], index=df.index)
ax = df[[FEATURE, 'temp_7_day_mean']].plot(figsize=(10, 4))
# ax = df_7day.plot(figsize=(10, 4))
ax.set_title("Average Daily Temperature: Actual vs 7-Day Rolling Windows")
ax.set_xlabel("Date")
ax.set_ylabel("Temperature")
ax.legend(['Temperature: Actual', 'Temperature: 7-Day Rolling Windows'])
print()
Generate the 30-day average temperature values using the daily temperature readings from the past 30 days. Define new column temp_30_day_mean to store these values.
df['temp_30_day_mean'] = df[FEATURE].rolling(window=30).mean()
ax = df[[FEATURE, 'temp_30_day_mean']].plot(figsize=(10, 4))
ax.set_title("Average Daily Temperature: Actual vs 30-Day Rolling Windows")
ax.set_xlabel("Date")
ax.set_ylabel("Temperature")
ax.legend(['Temperature: Actual', 'Temperature: 30-Day Rolling Windows'])
print()
df
Selected features are
temp_7_day_meantemp_30_day_meanwhich are generated using rolling window techniques. Then, split the data into training and test set with the following condition:
#assigning the train, test, label and features.
features = ['temp_7_day_mean', 'temp_30_day_mean'] #features used
label = FEATURE # label
date_sep = '2012-09-01'
test_df = df[df.index >= date_sep] # index for test data
train_df = df[df.index < date_sep] # index for train data
train_df = train_df.dropna()
X_train, y_train = train_df[features], train_df[label] #assign train data
X_test, y_test = test_df[features], test_df[label] # assign test data
Define helper functions to calculate errors and plot actual and predicted values using Machine Learning models.
# Define helper functions
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def calculate_errors(regressor, x_test, y_test):
mae_calc = mean_absolute_error(y_test, regressor.predict(x_test))
mre_calc = mean_relative_error(y_test, regressor.predict(x_test))
mse_calc = mean_squared_error(y_test, regressor.predict(x_test))
rmse_calc = rmse(y_test, regressor.predict(x_test))
print('MAE: ', mae_calc)
print('MRE: ', mre_calc)
print('MSE: ', mse_calc)
print('RMSE:', rmse_calc)
def plot_actual_and_predicted_values(df, regressor, x_test, y_test):
predictions = regressor.predict(x_test)
df['predictions'] = predictions
ax = df[[FEATURE, 'predictions']].plot(figsize=(10, 4))
ax.set_title("Average Daily Temperature: Actual vs Predicted")
ax.set_xlabel("Date")
ax.set_ylabel("Temperature")
ax.legend(['Temperature: Actual', 'Temperature: Predicted'])
print()
## Xgb with set value learning_rate = 0.01
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
# # Decision Tree Regressor
# reg1 = DecisionTreeRegressor(random_state=0)
# reg1.fit(X_train, y_train)
# # Bagging Regressor
# reg1 = BaggingRegressor(DecisionTreeRegressor(), random_state=0, n_estimators=10)
# reg1.fit(X_train, y_train)
# XGBRegressor
reg1 = XGBRegressor(n_estimators=100, learning_rate=0.01)
reg1.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='mae')
print("\nCalculated errors:")
calculate_errors(reg1, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg1, X_test, y_test)
## Xgb with set value learning_rate = 0.1
reg2 = XGBRegressor(n_estimators=100, learning_rate=0.1)
reg2.fit(X_train,
y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='mae')
print("\nCalculated errors:")
calculate_errors(reg2, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg2, X_test, y_test)
## Xgb with set value learning_rate = 0.3
reg3 = XGBRegressor(n_estimators=100, learning_rate=0.3)
reg3.fit(X_train,
y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='mae')
print("\nCalculated errors:")
calculate_errors(reg3, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg3, X_test, y_test)
# Decision Tree Regressor
reg4 = DecisionTreeRegressor(random_state=0)
reg4.fit(X_train, y_train)
print("\nCalculated errors:")
calculate_errors(reg4, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg4, X_test, y_test)
# Bagging Regressor
reg5 = BaggingRegressor(DecisionTreeRegressor(), random_state=0, n_estimators=10)
reg5.fit(X_train, y_train)
print("\nCalculated errors:")
calculate_errors(reg5, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg5, X_test, y_test)
The best model is Model 5, which is the bagging model with decision tree as the base estimator.
The prediction of this model yields the least errors between the actual and the predicted daily temperatures.
Calculated errors:
MAE: 2.086231543877837
MRE: 0.12516884716291504
MSE: 6.655657701832612
RMSE: 2.579856139755202
predictions = reg5.predict(X_test)
df.info()
test_df['predictions'] = predictions
test_df.info()
plot_actual_and_predicted_values(test_df, reg5, X_test, y_test)
test_df
Define helper function to transform a time series data into the format suitable for supervised learning tasks.
# ================================================================================
# Define helper function to transform a time series data into a supervised
# ... learning data.
# ================================================================================
def series_to_supervised(data, n_in=2, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[0]
df = DataFrame(data)
cols = list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
# put it all together
agg = concat(cols, axis=1)
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg.values
Forecast the daily temperature from 1 - 10 Jan 2013 (10 days) using the best Machine Learning model.
# ================================================================================
# Since we have build the model, in this stage we are going to use the whole set
# ... of data again, but only select the temperature column.
# ... This is because we only want to do forecasting on the temperature column.
# ================================================================================
values = df[FEATURE].values # select column temperature.
preds = []
# ================================================================================
# Define the number of forecasting. 10 indicates 10 new value will be forecasted.
# ... You can change it accordingly.
# ================================================================================
num_days_pred = 10
for i in range(num_days_pred):
# transform the time series data into supervised learning.
# n_in=6 meaning you are using the first 6 value to forecast the value ahead
forecast = series_to_supervised(values, n_in=6)
# split into input and output columns
forecastI, forecastO = forecast[:, :-1], forecast[:, -1]
# fit model. Took back the value the we built before.
# model = XGBRegressor(n_estimators=100, learning_rate=0.1)
model = BaggingRegressor(DecisionTreeRegressor(), random_state=0, n_estimators=10)
model.fit(forecastI, forecastO)
# construct an input for a new preduction
row = values[-6:].flatten()
# make a prediction
yhat = model.predict(asarray([row]))
print('Input: %s, Predicted: %.3f' % (row, yhat[0]))
values = np.append(values, yhat)
preds.append(yhat)
import datetime
df_preds = pd.DataFrame(preds, index=df.index[-len(preds):] + datetime.timedelta(days=num_days_pred))
ax = df[df.index >= date_sep][FEATURE].plot(figsize=(10, 4), legend=True, label='Temperature: Actual')
ax = test_df['predictions'].plot(ax=ax, label='Temperature: Predicted')
ax = df_preds.plot(ax=ax, legend=True, label='Temperature: Forecast')
ax.set_title("Average Daily Temperature: Actual vs Predicted vs Forecast")
ax.set_xlabel("Date")
ax.set_ylabel("Temperature")
ax.legend(['Temperature: Actual', 'Temperature: Predicted', 'Temperature: Forecast'])
print()